Optimization Algorithms

Convex optimization

local minima
saddle points
- any location where all gradients of a function vanish but which is neither a global nor a local minimum

- Almost all optimization problems arising in deep learning are nonconvex
Vanishing / Exploding gradients
- Vanishing gradients: the gradients of a neural network's loss function become extremely small, causing the weights to update very slowly and making it difficult for the network to learn effectively
- Exploding gradients: gradients become excessively large, leading to unstable updates and causing the model's parameters to oscillate or diverge during training

Normalizing inputs
- for tanh: Xavier initialization
- for ReLU: He Initialization
Gradient checking
- only use in debugging
- remember regularization
- doesn't work for dropout
Hyperparameter Tuning: Hyperparameter Tuning#Hyperparameter Tuning in Deep Learning
Batch Normalization
- applies normalization on the inputs of hidden layers
- weakens the coupling between what the early layers parameters have to do and what the later layers parameters have to do. So it allows each layer of the network to learn by itself, a little bit more independently of other layers, and this has the effect of speeding up of learning in the whole network.
- can add a slight regularization effect because of adding noise to hidden layers